arXiv: 1506.05751 vl [cs.CV] 18Jun2015 


Deep Generative Image Models using a 
Laplacian Pyramid of Adversarial Networks 


Emily Denton* 

Dept, of Computer Science 
Courant Institute 
New York University 


Soumith Chintala* Arthur Szlam Rob Fergus 

Facebook AI Research 
New York 


Abstract 


In this paper we introduce a generative parametric model capable of producing 
high quality samples of natural images. Our approach uses a cascade of convo¬ 
lutional networks within a Laplacian pyramid framework to generate images in 
a coarse-to-fine fashion. At each level of the pyramid, a separate generative con- 
vnet model is trained using the Generative Adversarial Nets (GAN) approach nsi. 

Samples drawn from our model are of significantly higher quality than alternate 
approaches. In a quantitative assessment by human evaluators, our CIFAR10 sam¬ 
ples were mistaken for real images around 40% of the time, compared to 10% for 
samples drawn from a GAN baseline model. We also show samples from models 
trained on the higher resolution images of the LSUN scene dataset. 

1 Introduction 

Building a good generative model of natural images has been a fundamental problem within com¬ 
puter vision. However, images are complex and high dimensional, making them hard to model 
well, despite extensive efforts. Given the difficulties of modeling entire scene at high-resolution, 
most existing approaches instead generate image patches. In contrast, in this work, we propose 
an approach that is able to generate plausible looking scenes at 32 x 32 and 64 x 64. To do this, 
we exploit the multi-scale structure of natural images, building a series of generative models, each 
of which captures image structure at a particular scale of a Laplacian pyramid (T|. This strategy 
breaks the original problem into a sequence of more manageable stages. At each scale we train a 
convolutional network-based generative model using the Generative Adversarial Networks (GAN) 
approach of Goodfellow et al. ED. Samples are drawn in a coarse-to-fine fashion, commencing 
with a low-frequency residual image. The second stage samples the band-pass structure at the next 
level, conditioned on the sampled residual. Subsequent levels continue this process, always condi¬ 
tioning on the output from the previous scale, until the final level is reached. Thus drawing samples 
is an efficient and straightforward procedure: taking random vectors as input and running forward 
through a cascade of deep convolutional networks (convnets) to produce an image. 

Deep learning approaches have proven highly effective at discriminative tasks in vision, such as 
object classification 0. However, the same level of success has not been obtained for generative 
tasks, despite numerous efforts [13, 24. 28 ]. Against this background, our proposed approach makes 
a significant advance in that it is straightforward to train and sample from, with the resulting samples 
showing a surprising level of visual fidelity, indicating a better density model than prior methods. 

1.1 Related Work 

Generative image models are well studied, falling into two main approaches: non-parametric and 
parametric. The former copy patches from training images to perform, for example, texture synthesis 
OS or super-resolution 0. More ambitiously, entire portions of an image can be in-painted, given a 
sufficiently large training dataset IT2l . Early parametric models addressed the easier problem of tex- 

* denotes equal contribution. 
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ture synthesis EEDEQ), with Portilla & Simoncelli l20l making use of a steerable pyramid wavelet 
representation [25], similar to our use of a Laplacian pyramid. For image processing tasks, models 
based on marginal distributions of image gradients are effective 118,1231, but are only designed for 
image restoration rather than being true density models (so cannot sample an actual image). Very 
large Gaussian mixture models ([32) and sparse coding models of image patches |29| can also be 
used but suffer the same problem. 

A wide variety of deep learning approaches involve generative parametric models. Restricted Boltz¬ 
mann machines CCD [EMU ED, Deep Boltzmann machines 1241171. Denoising auto-encoders [28] 
all have a generative decoder that reconstructs the image from the latent representation. Variational 
auto-encoders |15,|22) provide probabilistic interpretation which facilitates sampling. However, for 
all these methods convincing samples have only been shown on simple datasets such as MNIST 
and NORB, possibly due to training complexities which limit their applicability to larger and more 
realistic images. 

Several recent papers have proposed novel generative models. Dosovitskiy et al. Q showed how a 
convnet can draw chairs with different shapes and viewpoints. While our model also makes use of 
convnets, it is able to sample general scenes and objects. The DRAW model of Gregor et al. HTTl 
used an attentional mechanism with an RNN to generate images via a trajectory of patches, showing 
samples of MNIST and CIFAR10 images. Sohl-Dickstein et al ll26l use a diffusion-based process 
for deep unsupervised learning and the resulting model is able to produce reasonable CIFAR10 sam¬ 
ples. Theis and Bethge |27) employ LSTMs to capture spatial dependencies and show convincing 
inpainting results of natural textures. 

Our work builds on the GAN approach of Goodfellow et al. ED which works well for smaller 
images (e.g. MNIST) but cannot directly handle large ones, unlike our method. Most relevant to our 
approach is the preliminary work of Mirza and Osindero fTTI and Gauthier (9) who both propose 
conditional versions of the GAN model. The former shows MNIST samples, while the latter focuses 
solely on frontal face images. Our approach also uses several forms of conditional GAN model but 
is much more ambitious in its scope. 

2 Approach 

The basic building block of our approach is the generative adversarial network (GAN) of Goodfellow 
et al. ED. After reviewing this, we introduce our LAPGAN model which integrates a conditional 
form of GAN model into the framework of a Laplacian pyramid. 

2.1 Generative Adversarial Networks 

The GAN approach m is a framework for training generative models, which we briefly explain in 
the context of image data. The method pits two networks against one another: a generative model G 
that captures the data distribution and a discriminative model D that distinguishes between samples 
drawn from G and images drawn from the training data. In our approach, both G and D are convo¬ 
lutional networks. The former takes as input a noise vector z drawn from a distribution PNoise(z) and 
outputs an image h. The discriminative network D takes an image as input stochastically chosen 
(with equal probability) to be either h - as generated from G, or h - a real image drawn from the 
training data PData(h). D outputs a scalar probability, which is trained to be high if the input was 
real and low if generated from G. A minimax objective is used to train both models together: 

minmaxE^ PData(h) [log D{h)\ + E 2 ^ PNoise(z) [log(l - D(G(z)))\ (1) 

Cjr J_J 

This encourages G to fit PData(h) so as to fool D with its generated samples h. Both G and D 
are trained by backpropagating the loss in Eqn. [I] through their respective models to update the 
parameters. 

The conditional generative adversarial net (CGAN) is an extension of the GAN where both networks 
G and D receive an additional vector of information l as input. This might contain, say, information 
about the class of the training example h. The loss function thus becomes 

imnmaxE hi ^ PData(hi i ) [log£>(/i,0] +E^ PNoire(z)i ^ Pi( i ) [log(l - D(G(z,l),l))\ (2) 

where pi( 1) is, for example, the prior distribution over classes. This model allows the output of 
the generative model to be controlled by the conditioning variable /. Mirza and Osindero ED and 
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Gauthier El both explore this model with experiments on MNIST and faces, using l as a class 
indicator. In our approach, l will be another image, generated from another CGAN model. 

2.2 Laplacian Pyramid 

The Laplacian pyramid m is a linear invertible image representation consisting of a set of band-pass 
images, spaced an octave apart, plus a low-frequency residual. Formally, let d(.) be a downsampling 
operation which blurs and decimates a j x j image /, so that d{I) is a new image of size j/2 x j/2. 
Also, let u(.) be an upsampling operator which smooths and expands I to be twice the size, so u{l) 
is a new image of size 2j x 2j. We first build a Gaussian pyramid Q(I) = [4, 4,.. ., Ik], where 
Jo = I and Ik is k repeated application^] of d(.) to I. K is the number of levels in the pyramid, 
selected so that the final level has very small spatial extent (< 8 x 8 pixels). 

The coefficients hk at each level k of the Laplacian pyramid C(I) are constructed by taking the 
difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with u(.) 
so that the sizes are compatible: 

hk = C k (I) = Gk(I) ~ u{Gk+i(I)) = 4 ~ u(I k +i) ( 3 ) 

Intuitively, each level captures image structure present at a particular scale. The final level of the 
Laplacian pyramid hx is not a difference image, but a low-frequency residual equal to the final 
Gaussian pyramid level, i.e. hx = Ik- Reconstruction from a Laplacian pyramid coefficients 
[hi,..., hx] is performed using the backward recurrence: 

4 ^(4+i) T~ hk (4) 

which is started with lx = hx and the reconstructed image being / = I Q . In other words, starting 
at the coarsest level, we repeatedly upsample and add the difference image h at the next finer level 
until we get back to the full resolution image. 

2.3 Laplacian Generative Adversarial Networks (LAPGAN) 

Our proposed approach combines the conditional GAN model with a Laplacian pyramid represen¬ 
tation. The model is best explained by first considering the sampling procedure. Following training 
(explained below), we have a set of generative convnet models {Go,..., Gx}, each of which cap¬ 
tures the distribution of coefficients hk for natural images at a different level of the Laplacian pyra¬ 
mid. Sampling an image is akin to the reconstruction procedure in Eqn.[4] except that the generative 
models are used to produce the hk s: 

4 = u(4+i) + hk = u(Ik+i) + Gk(zk,u(Ik+i)) (5) 

The recurrence starts by setting Ix+i = 0 and using the model at the final level Gx to generate a 
residual image lx using noise vector zx- Ik = Gx(zx)- Note that models at all levels except the 
final are conditional generative models that take an upsampled version of the current image 4 + i as 
a conditioning variable, in addition to the noise vector Zk- Fig. [I] shows this procedure in action for 
a pyramid with K = 3 using 4 generative models to sample a 64 x 64 image. 

The generative models {Go,..., Gk} are trained using the CGAN approach at each level of the 
pyramid. Specifically, we construct a Laplacian pyramid from each training image I. At each level 

*i.e./ 2 = d(d(I)). 



Figure 1 : The sampling procedure for our LAPGAN model. We start with a noise sample Z3 (right side) and 
use a generative model G3 to generate I3 . This is upsampled (green arrow) and then used as the conditioning 
variable (orange arrow) I2 for the generative model at the next level, G 2 . Together with another noise sample 
Z2, G 2 generates a difference image J12 which is added to Z 2 to create / 2 . This process repeats across two 
subsequent levels to yield a final full resolution sample Iq. 
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Figure 2: The training procedure for our LAPGAN model. Starting with a 64x64 input image / from our 
training set (top left): (i) we take Io = I and blur and downsample it by a factor of two (red arrow) to produce 
ii; (ii) we upsample I± by a factor of two (green arrow), giving a low-pass version lo of Io; (iii) with equal 
probability we use lo to create either a real or a generated example for the discriminative model Do. In the real 
case (blue arrows), we compute high-pass ho — Io — lo which is input to Do that computes the probability of 
it being real vs generated. In the generated case (magenta arrows), the generative network Go receives as input 
a random noise vector zo and lo . It outputs a generated high-pass image ho = Go(zo, lo ), which is input to 
Do. In both the real/generated cases, Do also receives lo (orange arrow). Optimizing Eqn.[2] Go thus learns 
to generate realistic high-frequency structure ho consistent with the low-pass image lo. The same procedure is 
repeated at scales 1 and 2, using I\ and I 2 . Note that the models at each level are trained independently. At 
level 3 ,13 is an 8x8 image, simple enough to be modeled directly with a standard GANs G3 & D 3 . 


we make a stochastic choice (with equal probability) to either (i) construct the coefficients hk either 
using the standard procedure from Eqn.[3] or (ii) generate them using G&: 


hk 1 ( 6 ) 

Note that Gk is a convnet which uses a coarse scale version of the image Ik = u(Ik+i) as an input, 
as well as noise vector Zk- Dj . c takes as input hk or hk, along with the low-pass image Ik (which is 
explicitly added to hk or hk before the first convolution layer), and predicts if the image was real or 
generated. At the final scale of the pyramid, the low frequency residual is sufficiently small that it 
can be directly modeled with a standard GAN: hx = Gx(zx ) and Dk only has hx or hx as input. 
The framework is illustrated in Fig.[2j 

Breaking the generation into successive refinements is the key idea in this work. Note that we give 
up any “global” notion of fidelity; we never make any attempt to train a network to discriminate 
between the output of a cascade and a real image and instead focus on making each step plausible. 
Furthermore, the independent training of each pyramid level has the advantage that it is far more 
difficult for the model to memorize training examples - a hazard when high capacity deep networks 
are used. 

As described, our model is trained in an unsupervised manner. However, we also explore variants 
that utilize class labels. This is done by add a 1-hot vector c, indicating class identity, as another 
conditioning variable for Gk and Dk . 

3 Model Architecture & Training 

We apply our approach to three datasets: (i) CIFAR10 - 32x32 pixel color images of 10 different 
classes, 100k training samples with tight crops of objects; (ii) STL - 96x96 pixel color images of 
10 different classes, 100k training samples (we use the unlabeled portion of data); and (iii) LSUN 
El 10M images of 10 different natural scene types, downsampled to 64x64 pixels. 

For each dataset, we explored a variety of architectures for We now detail the best 

performing models, selected using a combination of log-likelihood and visual appearance of the 
samples. Complete Torch specification files for all models are provided in supplementary material 
n. For all models, the noise vector Zk is drawn from a uniform [-1,1] distribution. 
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3.1 CIFAR10 and STL 

Initial scale: This operates at 8 x 8 resolution, using densely connected nets for both Gk & Dk 
with 2 hidden layers and ReLU non-linearities. Dk uses Dropout and has 600 units/layer vs 1200 
for Gk • %k is a 100-d vector. 

Subsequent scales: For CIFAR10, we boost the training set size by taking four 28 x 28 crops from 
the original images. Thus the two subsequent levels of the pyramid are 8 14 and 14 —)► 28. For 

STL, we have 4 levels going from 8 16 —>> 32 —>> 64 96. For both datasets, Gk & Dk are 

convnets with 3 and 2 layers, respectively (see El). The noise input Zk to Gk is presented as a 4th 
“color plane” to low-pass Ij . c , hence its dimensionality varies with the pyramid level. For CIFAR10, 
we also explore a class conditional version of the model, where a vector c encodes the label. This is 
integrated into Gk & Dk by passing it through a linear layer whose output is reshaped into a single 
plane feature map which is then concatenated with the 1st layer maps. The loss in Eqn.[2jis trained 
using SGD with an initial learning rate of 0.02, decreased by a factor of (1 + 4 x 10^) at each 
epoch. Momentum starts at 0.5, increasing by 0.0008 at epoch up to a maximum of 0.8. During 
training, we monitor log-likelihood using a Parzen-window estimator and retain the best performing 
model. Training time depends on the models size and pyramid level, with smaller models taking 
hours to train and larger models taking several days. 

3.2 LSUN 

The larger size of this dataset allows us to train a separate LAPGAN model for each the 10 different 
scene classes. During evaluation, so that we may understand the variation captured by our models, 
we commence the sampling process with validation set image^j] downsampled to 4 x 4 resolution. 

The four subsequent scales 4 —8 16 —» 32 —» 64 use a common architecture for Gk & Dk at 
each level. Gk is a 5-layer convnet with {64,368,128, 224} feature maps and a linear output layer. 
7x7 filters, ReLUs, batch normalization m and Dropout are used at each hidden layer. Dk has 

3 hidden layers with {48,448,416} maps plus a sigmoid output. See (4) for full details. Note that 
Gk and Dk are substantially larger than those used for CIFAR10 and STL, as afforded by the larger 
training set. 

4 Experiments 

We evaluate our approach using 3 different methods: (i) computation of log-likelihood on a held 
out image set; (ii) drawing sample images from the model and (iii) a human subject experiment that 
compares (a) our samples, (b) those of baseline methods and (c) real images. 

4.1 Evaluation of Log-Likelihood 

A traditional method for evaluating generative models is to measure their log-likelihood on a held 
out set of images. But, like the original GAN method Go), our approach does not have a direct 
way of computing the probability of an image. Goodfellow et al. I'm) propose using a Gaussian 
Parzen window estimate to compute log-likelihoods. Despite showing poor performance in high 
dimensional spaces, this approach is the best one available for estimating likelihoods of models 
lacking an explicitly represented density function. 

Our LAPGAN model allows for an alternative method of estimating log-likelihood that exploits the 
multi-scale structure of the model. This new approach uses a Gaussian Parzen window estimate to 
compute a probability at each scale of the Laplacian pyramid. We use this procedure, described in 
detail in Appendix A, to compute the log-likelihoods for CIFAR10 and STL images (both at 32 x 32 
resolution). The parameter a (controlling the Parzen window size) was chosen using the validation 
set. We also compute the Parzen window based log-likelihood estimates of the standard GAN Col 
model, using 50k samples for both the CILAR10 and STL estimates. Table [T] shows our model 
achieving a significantly higher log-likelihood on both datasets. Comparisons to further approaches, 
notably (26), are problematic due to different normalizations used on the data. 

4.2 Model Samples 

We show samples from models trained on CILAR10, STL and LSUN datasets. Additional samples 
can be found in the supplementary material 0] . 


tThese were not used in any way during training. 
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Model 

CIFAR10 

STL (@32x32) 

GAN llOl 

LAPGAN 

-3617 ± 353 
-1799 ± 826 

-3661 ± 347 
-2906 ± 728 


Table 1: Parzen window based log-likelihood estimates for a standard GAN, our proposed LAPGAN 
model on CIFAR10 and STL datasets. 


Fig. [3] shows samples from our models trained on CIFAR10. Samples from the class conditional 
LAPGAN are organized by class. Our reimplementation of the standard GAN model ED produces 
slightly sharper images than those shown in the original paper. We attribute this improvement to 
the introduction of data augmentation. The LAPGAN samples improve upon the standard GAN 
samples. They appear more object-like and have more clearly defined edges. Conditioning on a 
class label improves the generations as evidenced by the clear object structure in the conditional 
LAPGAN samples. The quality of these samples compares favorably with those from the DRAW 
model of Gregor et al EE) and also Sohl-Dickstein et al l26l . The rightmost column of each image 
shows the nearest training example to the neighboring sample (in L2 pixel-space). This demonstrates 
that our model is not simply copying the input examples. 


Fig. 4(a) shows samples from our LAPGAN model trained on STL. Here, we lose clear object shape 
but the samples remain sharp. Fig. |4(b) shows the generation chain for random STL samples. 


Fig. [5] shows samples from LAPGAN models trained on three LSUN categories (tower, bedroom, 
church front). The 4x4 validation image used to start the generation process is shown in the first 
column, along with 10 different 64 x 64 samples, which illustrate the inherent variation captured 
by the model. Collectively, these show the models capturing long-range structure within the scenes, 
being able to recompose scene elements into credible looking images. To the best of our knowledge, 
no other generative model has been able to produce samples of this complexity. The substantial 
gain in quality over the CIFAR10 and STL samples is likely due to the much larger training LSUN 
training set which allowed us to train bigger and deeper models. 


4.3 Human Evaluation of Samples 

To obtain a quantitative measure of quality of our samples, we asked 15 volunteers to participate 
in an experiment to see if they could distinguish our samples from real images. The subjects were 
presented with the user interface shown in Fig. fright) and shown at random four different types 
of image: samples drawn from three different GAN models trained on CIFAR10 ((i) LAPGAN, (ii) 
class conditional LAPGAN and (iii) standard GAN flOl ) and also real CIFAR10 images. After being 
presented with the image, the subject clicked the appropriate button to indicate if they believed the 
image was real or generated. Since accuracy is a function of viewing time, we also randomly pick the 
presentation time from one of 11 durations ranging from 50ms to 2000ms, after which a gray mask 
image is displayed. Before the experiment commenced, they were shown examples of real images 
from CIFAR10. After collecting ^10k samples from the volunteers, we plot in Fig.[6]the fraction of 
images believed to be real for the four different data sources, as a function of presentation time. The 
curves show our models produce samples that are far more realistic than those from standard GAN 

tm 


5 Discussion 

By modifying the approach in m to better respect the structure of images, we have proposed a 
conceptually simple generative model that is able to produce high-quality sample images that are 
both qualitatively and quantitatively better than other deep generative modeling approaches. A key 
point in our work is giving up any “global” notion of fidelity, and instead breaking the generation 
into plausible successive refinements. We note that many other signal modalities have a multiscale 
structure that may benefit from a similar approach. 
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Figure 3: CIFAR10 samples: our class conditional CC-LAPGAN model, our LAPGAN model and 
the standard GAN model of Goodfellow ED. The yellow column shows the training set nearest 
neighbors of the samples in the adjacent column. 



(a) (b) 

Figure 4: STL samples: (a) Random 96x96 samples from our LAPGAN model, (b) Coarse-to-fine 
generation chain. 
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Figure 5: 64 x 64 samples from three different LSUN LAPGAN models (top: tower, middle: bed¬ 
room, bottom: church front). The first column shows the 4 x 4 validation set image used to start the 
generation process, with subsequent columns showing different draws from the model. 
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Figure 6: Left: Human evaluation of real CIFAR10 images (red) and samples from Goodfellow 
et al. rroi (magenta), our LAPGAN (blue) and a class conditional LAPGAN (green). The error 
bars show ±1 a of the inter-subject variability. Around 40% of the samples generated by our class 
conditional LAPGAN model are realistic enough to fool a human into thinking they are real images. 
This compares with < 10% of images from the standard GAN model ITOi but is still a lot lower 
than the > 90% rate for real images. Right: The user-interface presented to the subjects. 


Appendix A 


To describe the log-likelihood computation in our model, let us consider a two scale pyramid for 
the moment. Given a (vectorized) j x j image /, denote by l = d(I) the coarsened image, and 
h = I — u(d(I )) to be the high pass. In this section, to simplify the computations, we use a slightly 
different u operator than the one used to generate the images displayed in Fig. [3] Namely, here we 
take d(I) to be the mean over each disjoint block of 2 x 2 pixels, and take u to be the operator that 
removes the mean from each 2x2 block. Since u has rank 3d 2 /4, in this section, we write h in an 
orthonormal basis of the range of u , then the (linear) mapping from I to (/, h ) is unitary. We now 
build a probability density p on by 

p(I) = q 0 (l , h) qi (l) = q 0 (d(I),h(I)) qi (d(I)); 


in a moment we will carefully define the functions For now, suppose that qi > 0, f qi(l) dl = 1, 
and for each fixed l, f qo(l,h) dh = 1. Then we can check that p has unit integral: 


Jpdl = J q 0 (d(I),h(I)) qi (d(I))dI = j j 


qo(l, h)qi(l) dl dh = 1. 


Now we define the qi with Parzen window approximations to the densities of each of the scales. 
For qi, we take a set of training samples Zi, In 0 , and construct the density function qi(l) ~ 

Ylph We fix l = d(I) to define qo(I) = qo{l,h) ~ Ylf=i e ^ h ~ hi ^ 2 .For pyramids 

with more levels, we continue in the same way for each of the finer scales. Note we always use the 
true low pass at each scale, and measure the true high pass against the high pass samples generated 
from the model. Thus for a pyramid with K levels, the final log likelihood will be: log {qK^x)) + 

Ef^o 1 log(q k (l k ,h k )). 
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